library(tidyverse)
library(tidymodels)
library(GGally)
library(plotly)
library(vip)
library(caret)
library(mgcv)
I checked that the dataset has no NULLs or N/A values.
Plotted histograms of quantitative predictors (XN). Most of the predictor distributions were similar to Gaussian distributions. Some of them had uniform distributions (X10, X11, X15).
Plotted a correlation matrix. I observed strong correlations between pairs of variables such as:
Additionally, I tried to compute the correlation ratio between the decision and quantitative variables but didn’t succeed.
Plotted box plots between quantitative variables (XN) and the response variable. In my opinion, box plots were quite useful to see if classes are separable.
I observed that some variables have a large difference between their mean values and quartiles splitted per decision response. I observed such differences for the following variables: X1, X3, X7, X9, X12. My assumption is that such variables might be useful for classification.
Then I calculated scatter plots for all possible pairwise combinations of XN variables. Some of them visually seemed more separable than others. I plotted the most promising ones first and the rest later.
I plotted density and box plots using the caret
library. These were the same type of plots as before, so no new insights
except better visualization for density plots. On the density plot, you
can see that X15 seems promising.
Extracted feature importance from the random forest. We identified the following important features: X7, X12, X1, X9, X3, X15.
NB! Initially, I converted all quantitative variables to dummy variables before visualization section. I couldn’t spot any dependencies between nominal quantitative variables and response variables.
Additionally, I had an issue maintaining a consistent recipe in the workflow. When I saved the workflow and read it again, there was an issue with missing dummy columns (e.g., pet_dog) in the dataset. So, I had to revert the dataset transformation before visualization.
Next, I focused on models part.
I used cross-validation to evaluate the performance of the model. I also split the data 75% for training and 25% for testing.
I started with the predictors I discovered and tried manually eliminating one of them from the formula to see the impact.
Eventually, I ended up with a GAM model.
The aim of the homework is to find a good classification method for the variable deciasion (having values yes and no) in terms of other variables, when the cost of assinging yes to observations with actual value no is 3 times higher than assigning no to observations with actual value yes. The training data is in hw7_train.csv.
set.seed(39692)
df = read_delim("hw7_train.csv", delim = ",")
Rows: 4000 Columns: 17
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country, pet, color, decision
dbl (13): X1, X2, X3, X4, X5, X6, X7, X9, X10, X11, X12, X15, X16
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#TODO: inconsistency in workflow recipie
#rec = recipe(decision~.,data=df)|>step_dummy(all_nominal_predictors())
#df_baked = rec |> prep(df) |> bake(new_data=NULL)
splitted=initial_split(df,prop=0.75, strata = decision)
train_df=training(splitted)
test_df=testing(splitted)
cv_data=vfold_cv(train_df,strata=decision, v=5)
summary(train_df)
X1 X2 X3 X4 X5 X6 X7 country X9
Min. :-3.5583 Min. :-3.3861 Min. : 0.3181 Min. :-0.1551 Min. :-4.53690 Min. : 0.9562 Min. :-3.260476 Length:2999 Min. :-2.7866
1st Qu.:-1.0472 1st Qu.:-1.4529 1st Qu.: 6.5414 1st Qu.: 5.1030 1st Qu.:-0.09534 1st Qu.: 6.6631 1st Qu.: 0.001293 Class :character 1st Qu.:-0.1106
Median :-0.4031 Median :-0.9401 Median : 8.2556 Median : 6.4398 Median : 1.19656 Median : 8.8490 Median : 0.772772 Mode :character Median : 0.5729
Mean :-0.4067 Mean :-0.9423 Mean : 8.2812 Mean : 6.4541 Mean : 1.19540 Mean : 8.8060 Mean : 0.766059 Mean : 0.5873
3rd Qu.: 0.2454 3rd Qu.:-0.4490 3rd Qu.: 9.9920 3rd Qu.: 7.8085 3rd Qu.: 2.43014 3rd Qu.:11.0031 3rd Qu.: 1.512864 3rd Qu.: 1.2942
Max. : 2.5902 Max. : 1.6795 Max. :15.9472 Max. :13.3344 Max. : 7.77809 Max. :16.2009 Max. : 4.285420 Max. : 3.9140
X10 X11 X12 pet color X15 X16 decision
Min. :-1.331 Min. :1.146 Min. : 0.9767 Length:2999 Length:2999 Min. :-12.374 Min. : 0.09739 Length:2999
1st Qu.: 1.684 1st Qu.:3.450 1st Qu.: 6.1363 Class :character Class :character 1st Qu.: -7.899 1st Qu.: 4.20628 Class :character
Median : 4.074 Median :5.347 Median : 7.5989 Mode :character Mode :character Median : -4.443 Median : 5.62692 Mode :character
Mean : 4.088 Mean :5.260 Mean : 7.6059 Mean : -4.458 Mean : 5.75933
3rd Qu.: 6.452 3rd Qu.:7.058 3rd Qu.: 9.0908 3rd Qu.: -1.008 3rd Qu.: 7.18794
Max. : 9.539 Max. :9.349 Max. :16.0386 Max. : 3.372 Max. :13.83966
head(train_df)
str(train_df)
spc_tbl_ [2,999 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X1 : num [1:2999] 0.2655 0.0639 0.123 -0.2313 -0.7242 ...
$ X2 : num [1:2999] -1.138 -1.43 -1.604 -1.322 -0.996 ...
$ X3 : num [1:2999] 5.67 6.86 7.3 9.93 4.76 ...
$ X4 : num [1:2999] 3.1 4.95 4.14 8.73 6.36 ...
$ X5 : num [1:2999] 0.761 0.345 1.251 0.385 1.077 ...
$ X6 : num [1:2999] 4.48 8.47 8.57 7.79 9.89 ...
$ X7 : num [1:2999] -0.794 -0.958 0.668 0.531 -1.44 ...
$ country : chr [1:2999] "Latvia" "Estonia" "Latvia" "Estonia" ...
$ X9 : num [1:2999] -0.58 -0.415 0.955 2.67 2.234 ...
$ X10 : num [1:2999] 8.0725 8.9134 8.5025 -0.0426 0.1185 ...
$ X11 : num [1:2999] 2.03 5.72 7.54 5.23 6.2 ...
$ X12 : num [1:2999] 9.62 9.21 6.25 6.97 6.05 ...
$ pet : chr [1:2999] "snake" "dog" "cat" "hamster" ...
$ color : chr [1:2999] "red" "blue" "blue" "blue" ...
$ X15 : num [1:2999] -2.86 -10.55 -10.08 -2.57 -4.07 ...
$ X16 : num [1:2999] 5.48 2.35 6.92 4.64 4.63 ...
$ decision: chr [1:2999] "no" "no" "no" "no" ...
- attr(*, "spec")=
.. cols(
.. X1 = col_double(),
.. X2 = col_double(),
.. X3 = col_double(),
.. X4 = col_double(),
.. X5 = col_double(),
.. X6 = col_double(),
.. X7 = col_double(),
.. country = col_character(),
.. X9 = col_double(),
.. X10 = col_double(),
.. X11 = col_double(),
.. X12 = col_double(),
.. pet = col_character(),
.. color = col_character(),
.. X15 = col_double(),
.. X16 = col_double(),
.. decision = col_character()
.. )
- attr(*, "problems")=<externalptr>
Decision probability we estimate from the train sample:
freq_table = table(train_df$decision)
probabilities <- freq_table / sum(freq_table)
print(probabilities)
no yes
0.4908303 0.5091697
probability_no = probabilities[1]
probability_yes = probabilities[2]
hist(train_df$X1, main = "Distribution of X1", xlab = "X1")
hist(train_df$X2, main = "Distribution of X2", xlab = "X2")
hist(train_df$X3, main = "Distribution of X3", xlab = "X3")
hist(train_df$X4, main = "Distribution of X4", xlab = "X4")
hist(train_df$X5, main = "Distribution of X5", xlab = "X5")
hist(train_df$X6, main = "Distribution of X6", xlab = "X6")
hist(train_df$X7, main = "Distribution of X7", xlab = "X7")
hist(train_df$X9, main = "Distribution of X9", xlab = "X9")
hist(train_df$X10, main = "Distribution of X10", xlab = "X10")
hist(train_df$X11, main = "Distribution of X11", xlab = "X11")
hist(train_df$X12, main = "Distribution of X12", xlab = "X12")
hist(train_df$X15, main = "Distribution of X15", xlab = "X15")
hist(train_df$X16, main = "Distribution of X16", xlab = "X16")
ggcorr(train_df[1:14],
label = TRUE, # Add correlation values
label_round = 2, # Round values to 2 decimal places
hjust = 1, # Adjust label alignment
low = "blue", # Color for negative correlations
mid = "white", # Color for neutral correlations
high = "red", # Color for positive correlations
label_size = 2 # Increase label text size
)
Warning in ggcorr(train_df[1:14], label = TRUE, label_round = 2, hjust = 1, :
data in column(s) 'country', 'pet', 'color' are not numeric and were ignored
More or less promising predictors in terms of separability for decision respoinse:
All pairwise combinations scatter plots
ggplot(train_df) + geom_point(aes(x = X1, y = X2, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X3, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X4, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X5, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X6, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X1, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X3, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X4, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X5, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X6, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X2, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X4, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X5, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X6, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X3, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X5, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X6, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X4, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X6, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X5, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X7, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X6, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X9, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X7, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X9, y = X10, col = decision))
ggplot(train_df) + geom_point(aes(x = X9, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X9, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X9, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X9, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X10, y = X11, col = decision))
ggplot(train_df) + geom_point(aes(x = X10, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X10, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X10, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X11, y = X12, col = decision))
ggplot(train_df) + geom_point(aes(x = X11, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X11, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X12, y = X15, col = decision))
ggplot(train_df) + geom_point(aes(x = X12, y = X16, col = decision))
ggplot(train_df) + geom_point(aes(x = X15, y = X16, col = decision))
table(train_df$pet)
cat dog hamster snake turtle
324 313 858 916 588
table(train_df$color)
blue green red
1503 589 907
table(train_df$country)
Estonia Latvia Lithuania
1476 1244 279
table(train_df$decision)
no yes
1472 1527
featurePlot(x = train_df[, 1:3],
y = train_df$decision,
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 3))
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
Warning in panel.xyplot(..., identifier = identifier) :
NAs introduced by coercion
featurePlot(x = train_df[, c(1,2,3,4,5,6,7,9,10,11,12,15,16)],
y = factor(train_df$decision),
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
Warning in draw.key(simpleKey(...), draw = FALSE) :
not enough rows for columns
featurePlot(x = train_df[, c(1,2,3,4,5,6,7,9,10,11,12,15,16)],
y = factor(train_df$decision),
plot = "box",
## Pass in options to bwplot()
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(4,1 ),
auto.key = list(columns = 2))
train_df <- train_df %>% mutate(across(all_of(skewed_vars), ~ log1p(.)))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(all_of(skewed_vars), ~log1p(.))`.
Caused by warning in `log1p()`:
! NaNs produced
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 1 remaining warning.
# Correlation heatmap
correlation_matrix <- train_df %>%
select(-decision) %>%
cor()
Error in cor(.) : 'x' must be numeric
summary(train_df)
X1 X2 X3 X4 X5 X6 X7 country X9
Min. :-3.5583 Min. :-3.3861 Min. : 0.3181 Min. :-0.1551 Min. :-4.53690 Min. : 0.9562 Min. :-3.260476 Length:2999 Min. :-2.7866
1st Qu.:-1.0472 1st Qu.:-1.4529 1st Qu.: 6.5414 1st Qu.: 5.1030 1st Qu.:-0.09534 1st Qu.: 6.6631 1st Qu.: 0.001293 Class :character 1st Qu.:-0.1106
Median :-0.4031 Median :-0.9401 Median : 8.2556 Median : 6.4398 Median : 1.19656 Median : 8.8490 Median : 0.772772 Mode :character Median : 0.5729
Mean :-0.4067 Mean :-0.9423 Mean : 8.2812 Mean : 6.4541 Mean : 1.19540 Mean : 8.8060 Mean : 0.766059 Mean : 0.5873
3rd Qu.: 0.2454 3rd Qu.:-0.4490 3rd Qu.: 9.9920 3rd Qu.: 7.8085 3rd Qu.: 2.43014 3rd Qu.:11.0031 3rd Qu.: 1.512864 3rd Qu.: 1.2942
Max. : 2.5902 Max. : 1.6795 Max. :15.9472 Max. :13.3344 Max. : 7.77809 Max. :16.2009 Max. : 4.285420 Max. : 3.9140
X10 X11 X12 pet color X15 X16 decision X1_X12
Min. :-1.331 Min. :1.146 Min. : 0.9767 Length:2999 Length:2999 Min. :-12.374 Min. : 0.09739 Length:2999 Min. :-19.528
1st Qu.: 1.684 1st Qu.:3.450 1st Qu.: 6.1363 Class :character Class :character 1st Qu.: -7.899 1st Qu.: 4.20628 Class :character 1st Qu.: -6.448
Median : 4.074 Median :5.347 Median : 7.5989 Mode :character Mode :character Median : -4.443 Median : 5.62692 Mode :character Median : -2.801
Mean : 4.088 Mean :5.260 Mean : 7.6059 Mean : -4.458 Mean : 5.75933 Mean : -1.662
3rd Qu.: 6.452 3rd Qu.:7.058 3rd Qu.: 9.0908 3rd Qu.: -1.008 3rd Qu.: 7.18794 3rd Qu.: 2.015
Max. : 9.539 Max. :9.349 Max. :16.0386 Max. : 3.372 Max. :13.83966 Max. : 33.444
X9_X12
Min. :-33.851
1st Qu.: -0.941
Median : 4.097
Mean : 2.727
3rd Qu.: 7.816
Max. : 23.803
formula = decision~s(X7) + s(X12) + s(X9) + s(X1) + s(X3) + s(X15) + s(X1_X12) + s(X9_X12)
rec=recipe(decision~.,data=train_df) |> step_dummy(all_nominal_predictors())
wf_gam=workflow()|>add_recipe(rec)|>
add_model(gen_additive_mod(mode="classification"),formula=formula)
res_tune_gam=tune_grid(wf_gam, resamples=cv_data, metrics =metric_set(mn_log_loss))
Warning: No tuning parameters have been detected, performance will be evaluated using the resamples with no tuning. Did you want to [tune()] parameters?
→ A | error: Not all variables in the recipe are present in the supplied training set: `X1_X12` and `X9_X12`.
There were issues with some computations A: x1
There were issues with some computations A: x2
There were issues with some computations A: x3
There were issues with some computations A: x5
There were issues with some computations A: x5
Warning: All models failed. Run `show_notes(.Last.tune.result)` for more information.
show_best(res_tune_gam,metric="mn_log_loss")
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Backtrace:
1. tune::show_best(res_tune_gam, metric = "mn_log_loss")
2. tune:::show_best.tune_results(res_tune_gam, metric = "mn_log_loss")
3. tune::.filter_perf_metrics(x, metric, eval_time)
4. tune::estimate_tune_results(x)
predictions_gam <- augment(wf_final_gam, new_data = test_df)
predictions_gam$decision = factor(predictions_gam$decision)
decision_lvls = levels(predictions_gam$decision)
stopifnot(decision_lvls == c('no','yes'))
decision_yes_level = if(decision_lvls[2] == 'yes')'second'else'first'
stopifnot(decision_yes_level == 'second')
roc_gam=predictions_gam|>roc_curve(decision,.pred_yes,event_level=decision_yes_level)
plot1=autoplot(roc_gam)
plot1
When the cost of assigning yes to observations with actual value no is 3 times higher than assigning no to observations with actual value yes.
min_loss_gam = roc_gam|>mutate(x=1-specificity,y=sensitivity,fn_value=3*probability_no*x+probability_yes*(1-y))|>filter(fn_value==min(fn_value))
print(min_loss_gam)
threashold_gam = min_loss_gam$.threshold
print(predictions_gam |> conf_mat(truth = decision, estimate = .pred_class))
predictions_after_cutoff_gam = predictions_gam |> mutate(.pred_class=factor(if_else(.pred_yes>threashold_gam,"yes","no"),levels=decision_lvls))
conf_matrix <- predictions_after_cutoff_gam |> conf_mat(truth = decision, estimate = .pred_class)
print(conf_matrix)
accuracy_score <- predictions_after_cutoff_gam |> accuracy(truth = decision, estimate = .pred_class)
print(accuracy_score)
# Recipe
rec <- recipe(
decision ~X7+X12+X9+X1+X3+X15,
data = train_df)
# Workflow for Random Forest
wf_rf <- workflow() |>
add_recipe(rec) |>
add_model(rand_forest(mode = "classification", mtry = tune(), trees = 512))
# Tuning the Random Forest model
res_tune_rf <- tune_grid(
wf_rf,
resamples = cv_data,
metrics = metric_set(mn_log_loss),
grid = 4 # You can adjust the number of grid points if needed
)
show_best(res_tune_rf, metric = "mn_log_loss")
wf_final_rf <- finalize_workflow(wf_rf, select_best(res_tune_rf, metric='mn_log_loss')) |>
fit(data = train_df)
predictions_rf <- augment(wf_final_rf, new_data = test_df)
predictions_rf$decision = factor(predictions_rf$decision)
roc_rf=predictions_rf|>roc_curve(decision,.pred_yes,event_level=decision_yes_level)
plot_gam_rf = plot1+geom_path(data=roc_rf,aes(x=1-specificity,y=sensitivity),color="red")
plot_gam_rf
When the cost of assigning yes to observations with actual value no is 3 times higher than assigning no to observations with actual value yes.
min_loss_rf = roc_rf|>mutate(x=1-specificity,y=sensitivity,fn_value=3*probability_no*x+probability_yes*(1-y))|>filter(fn_value==min(fn_value))
print(min_loss_rf)
threashold_rf = min_loss_rf$.threshold
print(predictions_rf |> conf_mat(truth = decision, estimate = .pred_class))
predictions_after_cutoff_rf = predictions_rf |> mutate(.pred_class=factor(if_else(.pred_yes>threashold_rf,"yes","no"),levels=decision_lvls))
conf_matrix <- predictions_after_cutoff_rf |> conf_mat(truth = decision, estimate = .pred_class)
print(conf_matrix)
accuracy_score <- predictions_after_cutoff_rf |> accuracy(truth = decision, estimate = .pred_class)
print(accuracy_score)
library(randomForest)
# Fit the Random Forest model
rf_model <- randomForest(
decision ~ .,
data = recipe(decision~.,data=train_df) |> step_dummy(all_nominal_predictors()) |> prep(data=train_df) |> bake(new_data=NULL),
importance = TRUE, ntree = 500)
# Print the model summary
print(rf_model)
# Calculate feature importance
importance <- importance(rf_model)
importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])
# Print feature importance
print(importance_df)
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Feature Importance for Random Forest Model",
x = "Feature",
y = "Importance")
rec <- recipe(decision ~ X7+X12+X9+X1+X3+X15, data = train_df)
# Workflow for SVM
wf_svm <- workflow() |>
add_recipe(rec) |>
add_model(
#svm_linear(mode="classification",cost=tune()) |> set_engine("kernlab",scaled=TRUE)
svm_poly(mode="classification",cost=tune(),engine="kernlab",degree=tune()) |> set_engine("kernlab", prob.model = TRUE)
)
cost_grid=expand_grid(cost=c(0.001, 0.01, 1),degree=2:3)
# Tuning the SVM model
res_tune_svm <- tune_grid(
wf_svm,
resamples = cv_data,
metrics = metric_set(mn_log_loss),
grid = cost_grid
)
show_best(res_tune_svm, metric = "mn_log_loss")
wf_final_svm <- finalize_workflow(
wf_svm,
select_best(res_tune_svm, metric = "mn_log_loss")
) |>
fit(data = train_df)
predictions_svm <- augment(wf_final_svm, new_data = test_df)
predictions_svm$decision = factor(predictions_svm$decision)
roc_svm=predictions_svm|>roc_curve(decision,.pred_yes,event_level=decision_yes_level)
plot_gam_rf+geom_path(data=roc_svm,aes(x=1-specificity,y=sensitivity),color="blue")
When the cost of assigning yes to observations with actual value no is 3 times higher than assigning no to observations with actual value yes.
min_loss_svm = roc_svm|>mutate(x=1-specificity,y=sensitivity,fn_value=3*probability_no*x+probability_yes*(1-y))|>filter(fn_value==min(fn_value))
print(min_loss_svm)
threashold_svm = min_loss_svm$.threshold
print(predictions_svm |> conf_mat(truth = decision, estimate = .pred_class))
predictions_after_cutoff_svm = predictions_svm |> mutate(.pred_class=factor(if_else(.pred_yes>threashold_svm,"yes","no"),levels=decision_lvls))
conf_matrix <- predictions_after_cutoff_svm |> conf_mat(truth = decision, estimate = .pred_class)
print(conf_matrix)
accuracy_score <- predictions_after_cutoff_svm |> accuracy(truth = decision, estimate = .pred_class)
print(accuracy_score)
sprintf("FINAL THRESHOLD: %f", threashold_gam)
sprintf("DECISION LVLS: %s", decision_lvls)
fitted_workflow <- finalize_workflow(wf_gam, select_best(res_tune_gam, metric='mn_log_loss')) |> fit(data = df)
my_predictions <- function(wf, new_data) {
predictions <- augment(wf, new_data = new_data)
predictions <- predictions |>
mutate(.pred_class = factor(
if_else(.pred_yes > 0.816406, "yes", "no"),
levels = c('no', 'yes')
))
return(predictions$.pred_class)
}
save(
model_classification = fitted_workflow,
prediction_function = my_predictions,
file = "ploter_hw7.Rdata"
)